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(54) Linguistic search system 

(57) A method of searching for information in a text 
database, comprising: receiving (s1) at least one user 
input, the user input(s) defining a natural language ex- 
pression, converting (s2, s3) the natural language ex- 
pression to a tagged form (50, 51) including part-of- 
speech tags, applying (s4) to the tagged form (51) one 
or more grammar rules of the language of the natural 
language expression (49), to derive a regular expres- 
sion (52), and analysing (s5) the text database to deter- 
mine whether there is a match between said regular ex- 
pression (52) and a portion of said text database. An 
apparatus for carrying out this techniques is also dis- 
closed. Users may find portions of a text which match 
multiword expressions given by the user. Matches in- 
clude possible variations that are relevant with the initial 
criteria from a linguistic point of view including simple 
inflections like plural/singular, masculine/feminine or 
conjugated verbs and even more complex variations like 
the insertion of additional adjectives, adverbs, etc. in be- 
tween the words specified by the user. 



[ The user enters ihe expression in noturol longuage 5 ' 



systeme distribue 
| The expression is disambiguated (or togged} '( ~ S ^ 



sysieme+NOUN_SG distnbueH-ADJ_SG 
i3 



\ The tagged farm is simplified to remove gender, number, etc 



systcme+NOUN* distribuer+ADJ* 



Apply relevant grammar roles Jo the simplified tog form and f~~& 
• generate the matching regular egression 



systemer-NOUN* ((®+ADJ|«h-ADV| 
O-rPAP)* (SH-COORD)*)* distnbuer+ADJ* 



The regular expression is matched against the togged version 
of the corpus 



[ The matched expression is pinpointed in the original tetf and highlighted 



FIG. 2 



< 

CO 
CM 
CM 

CO 
00 
00 

o 

Q_ 
LU 



Printed by Jouve, 75001 PARIS (FR) 



1 



EP 0 886 226 A1 



2 



Description 

The present invention relates to data processing, 
and more particularly to techniques for searching for in- 
formation in a text database or corpus. 

Most of the techniques in use to retrieve a piece of 
information in a text corpus are based on substring 
search (also known as full-text search). Because this 
basic string search mechanism is weak when the user 
wants to catch more than a simple sequence of charac- 
ters, various techniques have been developed by data 
providers to enhance the substring matching. Examples 
include wildcards, regular expressions, Boolean opera- 
tors, proximity factor (e.g. words must be in the same 
sentence or no more than N words between two words) 
and stemming. 

Existing techniques often try to achieve the similar 
goals: to allow the user to better express the variability 
of the natural language in which the string expression is 
to be searched in order not to miss any place where this 
expression appears. 

However, known techniques suffer from several 
drawbacks: the end user has to learn the query lan- 
guage proposed by the search engine; no two search 
engines have the same query language; if the user 
doesnt think of all the possible variations of the 
searched expression, he can miss some relevant doc- 
uments; and/or on the other hand, if the search expres- 
sion is too "loose", many irrelevant documents will be 
retrieved, generating noise. 

The present invention provides a method of search- 
ing for information in a text database, comprising: (a) 
receiving at least one user input, the user input(s) de- 
fining a natural language expression including one or 
more words, (b) converting the natural language expres- 
sion to a tagged form of the expression, the tagged form 
including said one or more words and, for the or each 
word, a part -of -speech tag associated therewith, (c) ap- 
plying to the tagged form one or more grammar rules of 
the language of the natural language expression, to de- 
rive a regular expression, and (d) analysing the text da- 
tabase to determine whether there is a match between 
said regular expression and a portion of said text data- 
base. 

Preferably, step (b) comprises the step of: tagging 
the natural language expression by, for the or each word 
in said natural language expression, (b1 ) converting the 
word to its root form, and (b2) applying a part-of -speech 
tag to the word, thereby generating a complex tagged 
form. 

Preferably, the part-of speech tag includes a syn- 
tactic category marker and a morphological feature 
marker, and wherein step (b) further comprises the step 
of: (b3) simplifying said complex tagged form by remov- 
ing the or each morphological feature marker, to gener- 
ate said tagged form. 

Preferably, the method further includes the step of 
(e) determining the location of said text database of a 



match with said regular expression. 

The invention further provides a programmable da- 
ta processing apparatus when suitably programmed for 
carrying out the method of any of the appended claims, 

5 or according to any of the particular embodiments de- 
scribed herein, the processor being coupled to the mem- 
ory arid user interface, and being operable in conjunc- 
tion therewith for executing instructions corresponding 
to the steps of said method(s). 

10 The linguistic search techniques according to the 
present invention overcome at least some of the above- 
mentioned problems. They rely both on the linguistic 
tools (such as atokeniser, morphological analyser and 
disambiguator) and the generation of complex regular 

15 expressions to match against the text database. 

This mechanism has the advantages over a basic 
full text search engine that the end user doesnt need to 
learn an esoteric query language. He just has to type 
the multiword expression he is looking for in natural lan- 

20 guage. 

A further advantage is that the retrieved documents 
will be much more relevant to the query from a linguistic 
point of view (although it doesn't ensure that all relevant 
documents will be retrieved from the point of view of the 

25 meaning). 

A further advantage is that many variations will be 
captured by the linguistic processing. As a conse- 
quence, even a user who is not familiar with the lan- 
guage in which the searched documents are written 

30 doesnt have to know about the linguistic variation that 
might occur. 

The linguistic search techniques according to the 
invention provide a new way to search for information 
in a text database. They enable users to find portions of 

35 a text which match multiword expressions given by the 
user. Matches include possible variations that are rele- 
vant with the initial criteria from a linguistic point of view 
including simple inflections like plural/singular, mascu- 
line/feminine or conjugated verbs and even more com- 

40 piex variations like the insertion of additional adjectives, 
adverbs, etc. in between the words specified by the user. 
This technique can complement conventional full text 
search engines by reducing the number of retrieved 
documents that are inconsistent with the query. 

45 Embodiments of the invention will now be de- 
scribed, by way of example, with reference to the ac- 
companying drawings, in which: 

Figure 1 is a schematic block diagram of the com- 
so puter which may be used to implement the tech- 
niques according to an embodiment of the present 
invention; and 

Figure 2 is a schematic flow diagram of the steps in 
carrying out a linguistic search according to an em- 
55 bodiment of the present invention. 

It will be appreciated that the present invention may 
be implemented using conventional computer technol- 
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ogy. The invention has been implemented in Perl & C++ 
on a Sun workstation running SunOS. It will be appre- 
ciated that the invention may be implemented using a 
PC running Windows™, a Mac running MacOS, or a 
minicomputer running UNIX, which are well known in the 
art. For example, the PC hardware configuration is dis- 
cussed in detail in The Art of Electronics, 2nd Edn, Ch. 
10, P. Horowitz and W. Hill, Cambridge University Press, 
1989, and is illustrated in Fig. 1. Stated briefly, the sys- 
tem comprises, connected to common bus 30, a central 
processing unit 32, memory devices including random 
access memory (RAM) 34, read only memory (ROM) 36 
and disk, tape or CD-ROM drives 38, keyboard 12 (not 
shown), mouse 14 (not shown), printing, plotting or 
scanning devices 40, and A/D, D/A devices 42 and dig- 
ital input/output devices 44 providing interfacing to ex- 
ternal devices 46 such as the rest of a LAN (not shown). 

Figure 2 is a schematic flow diagram of the steps 
performed in carrying out a linguistic search according 
to an embodiment of the present invention. 

It will be apparent to persons skilled in the art that 
where references are made herein to steps, operations 
or manipulations involving characters, words, passages 
of text, etc., these are implemented, where appropriate, 
by means of software controlled processor operations 
upon machine readable (e.g. ASCII code) representa- 
tions of such characters, words and text. 

For the sake of illustrating the techniques according 
to the invention, the case is considered where the 
French expression "systeme distribue*" (equivalent to 
"distributed system" in English) is to be searched for in 
a French corpus by the user. 

Initially (step s1), the user specifies the multiword 
expression he is looking for, for example, using the type 
of graphical user interface which is well known in the art. 
There is no need to pay attention to the formulation of 
this expression: nouns and/or adjectives can be plural 
or singular, verbs can be conjugated, etc. 

Next, at step s2, the expression is then sent to the 
tagger (or disambiguator), such as are available from 
Xerox Corp. Taggers are discussed in more detail in 
McEnery T and Wilson A., Corpus Linguistics, Ch. 5, 
section 3 and Appendix B. The tagger (or disambigua- 
tor) does two things— 

(1) reduce each word to its root form (e.g. distribue 
becomes distribuer - infinitive form of the verb), and 

(2) determine the part-of-speech of each word (e.g. 
systeme is a singular noun - NOUN_SG- and dis- 
tribue is a singular adjective -ADJ_SG-). 
NOUN_SG and ADJ_SG are called tags. Each tag 
is made of two parts: the syntactic category (or part- 
of-speech like NOUN, AD J, VERB, etc.) and the 
morphological feature (like SG, PL, etc.) which re- 
flects the inflection of the word. 

Once the tagged form 50 has been obtained, it is 
then simplified, at step s3: because it is desired that the 



linguistic search process retrieves all possible inflec- 
tions of a word each tag is first reduced to its syntactic 
category. The gender, number or person of a word is 
useless for the linguistic search, and is removed. Pref- 

s erably, this comprises replacing each of "SG", "PL", etc. 
with a neutral symbol (*) so as to encompass all possi- 
bilities of morphological feature. 

The process continues at step s4, in which the sim- 
plified tagged form 51 is operated on. Given the gram- 

10 mar of a language it is possible to determine what kind 
of variations a multiword expression can undergo with- 
out changing its initial meaning. The following discus- 
sion presents some otthe rules that have been used for 
French to generate variations around nominal phrases: 

15 

(1) In between a noun and an adjective one can in- 
sert adjectives, adverbs, or past participates possi- 
bly connected by a co-ordinating conjunctions likeet 
{and), ou {or), etc. Figure 2 illustrates the applica- 

20 tion of this rule to the expression systemes dis- 
tribute and shows a simplified version of the result- 
ing regular expression (the symbol <8> represents the 
word preceding the tag). As an example, there fol- 
low some linguistic variations caught by this regular 

25 expression: 

=> systemes distribute (distributed systems - 
plural form) 

systemes relationnels distribute (distributed 
30 relational systems - inserted adjective) 

=> systeme redondant et totalement distribue' 
(fully redundant and distributed system - insert- 
ed adjective and adverb joined by a co-ordinat- 
ing conjunction) 

35 

(2) In between a noun and a preposition or a prep- 
osition and a noun, additional adjectives can be in- 
serted. 

(3) In between 2 nouns, additional adjectives can 
40 be inserted. 

The rules listed above apply to French noun phrases. 
They can be extended to any other kind of phrases, in- 
cluding those containing verbs, and also to any other 

45 language. 

It is to be noted that these rules can be almost as 
complex as desired if it is thought that there is a good 
chance for the selected portion of text to be still relevant 
with your initial query. For instance one could allow the 

so insertion of a new noun phrase in between the noun and 
the adjective like in "systeme a tolerance de panne dis- 
tribue'" (distributed fault tolerant system) or even more 
complex is the insertion of a relative clause like in "un 
systeme qui, par nature, est totalement distribue" (a sys- 

55 tern which, by essence, is fully distributed). 

The grammar rules expressed in step s4 are coded 
in a regular expression and matched against the simpli- 
fied tagged form 51 of the user query. If one of those 
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rules matches, then the simplified tagged form 51 of the 
user query transforms into a complex regular expres- 
sion representing the grammar variations. 

Each rule is applied in sequence and only once in 
order toavoid the recursive application of a grammar 
rule to itself or to others. 

The matching regular expression 52 is then proc- 
essed further at step s5. Once the final regular expres- 
sion 52 has been generated it is matched against the 
tagged version of the corpus. With respect to this step, 
it is important to note the following. 

(1 ) As stated above the matching process has to be 
made on a tagged version of the text corpus. This 
may be done using a tagger, such as that available 
from Xerox Corp. , as mentioned above. The tagging 
phase can be made eithe r on the fly, if the text varies 
frequently, or once for all if it is stable. 

(2) If the corpus is large, a simple sequential search 
on the tagged text will take too much time. To speed 
up this phase a full text indexing engine can be 
used. But instead of indexing the original text as 
most full-text search engines do, the indexing 
mechanism is applied to the tagged version of the 
text corpus. 

(3) Most existing full-text indexing engines cannot 
handle search queries expressed with complex reg- 
ular expressions. As a consequence the expression 
generated by the linguistic search system accord- 
ing to the present invention cannot be given as is to 
the search engine. In fact, a preliminary search is 
made on the individual words of the simplified 
tagged expression (see step s2). Depending on 
how sophisticated the indexing engine is, it can pro- 
vide the user with very basic information like the 
name ofthe files in which those words have been 
found (like the glimpse search engine does) or 
much more accurate information like the position of 
the sentence in which those words were found (like 
the Xerox PARC Text Database (TDB) does). This 
preliminary step reduces the scope of relevant (por- 
tion of) documents and decreases the time required 
by the regular expressions matching process. 

(4) The current implementation of an embodiment 
of the linguistic search system according to the in- 
vention is based on the regular expression conven- 
tions ofPerl (or any flavour of awk). It will be appre- 
ciated by persons skilled in the art that it could be 
easily transposed to the regular expression formal- 
ism used by the Finite State Transducers developed 
by Xerox Corp. (see EP-A-583,083). The matching 
mechanism is based on the regular expressions of- 
Perl rather than the Finite State transducers devel- 
oped by Xerox because Perl (and awk) tells the user 
not only what portion of the text matched but also 
where it is located in the corpus. This information is 
particularly noticeable in order to highlight the plac- 
es where a match occurred. This feature has two 



advantages: 

(1) avoid leafing through long documents to find 
places where matches occurred (See step s6 dis- 

s cussed below); and 

(2) show the whole matching multiword expression 
which can be quite different from the one typed by 
the user if the linguistic variations allowed by the 
grammar rules are complex. 

10 

Step s6 is performed after the regular expression 
has been matched against the tagged version of the cor- 
pus. As mentioned above, the Perl (or awk) regular ex- 
pressions mechanism can tell the user what string 

*5 matches but also where this string is located in the text. 
However because according to the invention the regular 
expression matching is done on the tagged version of 
the corpus, the positioning information is not suitable for 
the original text. As a consequence, if it is desired to 

20 highlight the matches a way must be provided to got rom 
the offset in the tagged text into the actual offset in the 
original text. Currently, this is made via a simple offset 
table built during the corpus tagging. 

It will be appreciated that numerous modifications 

25 may be made in implementing the techniques according 
to the invention. 

The linguistic search could be applied to WEB 
search engines. Although their query languages tend to 
be more and more sophisticated it's not yet close to a 

30 linguistic search. 

The process explained above assumes that the cor- 
pus to be searched is first disambiguated (or tagged). 
However, it will be appreciated that it would be possible 
to use the techniques according to the invention as a 

35 front-end to the WEB search engine, for example. Here, 
the requirement is to generate all the possible forms of 
a word and search for all of them with a conventional 
search engine (or at least the substring which is com- 
mon to all the derived form of a word). Then the selected 

40 documents need to be retrieved for further processing 
(tagging) before the linguistic search can be applied. 
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The method of claim 1 , 2 or 3, further including the 
step of (e) determining the location of said text da- 
tabase of a match with said regular expression (52). 

A programmable data processing apparatus when 
suitably programmed for carrying out the method of 
any of the preceding claims, the apparatus includ- 
ing a processor, memory, and a user interface, the 
processor being coupled to the memory and user 
interface, and being operable in conjunction there- 
with for executing instructions corresponding to the 
steps of said method(s). 



Claims 



20 



1. A method of searching for information in a text da- 
tabase, comprising: 

25 

(a) receiving at least one user input, the user 
input(s) defining a natural language expression 
(49) including one or more words, 

(b) converting the natural language expression 

to a tagged form (50, 51 ) of the expression, the 30 
tagged form including said one or more words 
and, for the or each word, a part-of-speech tag 
associated therewith, 

(c) applying to the tagged form (51 ) one or more 
grammar rules of the language of the natural 35 
language expression (49), to derive a regular 
expression (52), and 

(d) analysing the text database to determine 
whether there is a match between said regular 
expression (52) and a portion of said text data- 40 
base. 



2. The method of claim 1 , wherein step (b) comprises 
the step of: 

tagging the natural language expression by, 45 
for the or each word in said natural language ex- 
pression, (b1) converting the word to its root form, 
and (b2) applying a part-of-speech tag to the word, 
thereby generating a complex tagged form (50). 

50 

3. The method of claim 2, wherein the part-of speech 
tag includes a syntactic category marker and a mor- 
phological feature marker, and wherein step (b) fur- 
ther comprises the step of: (b3) simplifying said 
complex tagged form (50) by removing the or each 55 
morphological feature marker, to generate said 
tagged form (51). 



5 



r 



EP 0 886 226 A1 







I Bus Control 


I 




Registers 


I Stack Ptr I I Instr. Decode I 






| Prog. Ctr | | Arith. Loqic Unit | 






! Cache I 




I Flags | 



32 



34 



36 



38- 



RAM 



ROM 



Disk, 
Tope, 
CD-ROM 




■30 



Keyboard 



- I Mouse ( ~ 



f 12 
14 



42 



A/O Conv. 
D/A Conv. 



JZ. 



46 



LAN, etc. 



Digital I/O 



44 



FIG. 1 



6 



EP 0 886 226 A1 



The user enters the expression in natural language 



systeme distribue 



The expression is disambiguated (or tagged) 



s/ 
49 

-s2 



50 



systeme+NOUN_SG distribuer+ADJ_SG 

s3 



The tagged form is simplified to remove gender, number, etc 



51 

systeme+NOLTN* distribuer+ADJ* 

$4 



Apply relevant grammar rules to the simplified tag form and 
generate the matching regular expression 



52 



systeme+NOUN* ((®+ADJ|®+ADV| 
®+PAP)* (0+COORJD)*)* distribuer+ADJ* 



The regular expression is matched against the tagged version 
of the corpus 



-s5 



The matched expression is pinpointed in the original text and highlighted 



-s6 



FIG. 2 
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